Goto

Collaborating Authors

 exact natural gradient


Exact natural gradient in deep linear networks and its application to the nonlinear case

Neural Information Processing Systems

Stochastic gradient descent (SGD) remains the method of choice for deep learning, despite the limitations arising for ill-behaved objective functions. In cases where it could be estimated, the natural gradient has proven very effective at mitigating the catastrophic effects of pathological curvature in the objective function, but little is known theoretically about its convergence properties, and it has yet to find a practical implementation that would scale to very deep and large networks. Here, we derive an exact expression for the natural gradient in deep linear networks, which exhibit pathological curvature similar to the nonlinear case. We provide for the first time an analytical solution for its convergence rate, showing that the loss decreases exponentially to the global minimum in parameter space. Our expression for the natural gradient is surprisingly simple, computationally tractable, and explains why some approximations proposed previously work well in practice. This opens new avenues for approximating the natural gradient in the nonlinear case, and we show in preliminary experiments that our online natural gradient descent outperforms SGD on MNIST autoencoding while sharing its computational simplicity.


Reviews: Exact natural gradient in deep linear networks and its application to the nonlinear case

Neural Information Processing Systems

The main result is that the natural gradient completely removes pathological curvature introduced by depth, yielding exponential convergence in the total weights (as though it were a shallow network). The paper traces connections to a variety of previous methods to approximate the Fisher information matrix, and shows a preliminary application of the method to nonlinear networks (for which it is no longer exact), where it appears to speed up convergence. Major comments: This paper presents an elegant analysis of learning dynamics under the natural gradient. Even though the results are obtained for deep linear networks, they are decisive for this case and suggest strongly that future work in this direction could bring principled benefits for the nonlinear case (as shown at small scale in the nonlinear auto encoder experiment). The analysis provides solid intuitions for prior work on approximating second order methods, including an interesting observation on the structure of the Hessian: it is far from block diagonal, a common assumption in prior work. Yet off diagonal blocks are repeats of diagonal blocks, yielding similar results.


Exact natural gradient in deep linear networks and its application to the nonlinear case

Bernacchia, Alberto, Lengyel, Mate, Hennequin, Guillaume

Neural Information Processing Systems

Stochastic gradient descent (SGD) remains the method of choice for deep learning, despite the limitations arising for ill-behaved objective functions. In cases where it could be estimated, the natural gradient has proven very effective at mitigating the catastrophic effects of pathological curvature in the objective function, but little is known theoretically about its convergence properties, and it has yet to find a practical implementation that would scale to very deep and large networks. Here, we derive an exact expression for the natural gradient in deep linear networks, which exhibit pathological curvature similar to the nonlinear case. We provide for the first time an analytical solution for its convergence rate, showing that the loss decreases exponentially to the global minimum in parameter space. Our expression for the natural gradient is surprisingly simple, computationally tractable, and explains why some approximations proposed previously work well in practice.